Applications of PCA and MDS

In the homework PCA and MDS used to understand trends and toreduce complexity in images.

Task1.a Can over/under 2.5 game results be explained by the odds for different types of bets using PCA?

The homework will require using specific packages so these packages recalled first.

library(data.table)
library(MASS)
library(plotly)
library(jpeg)
library(imager) 

To start the analysis, the data is read using relevant functions and data tables are created.

DT_odd_details=data.table(readRDS("/Users/alifurkanyavuz/Downloads/odd_details.rds"))
DT_matches=data.table(readRDS("/Users/alifurkanyavuz/Downloads/matches.rds"))

We try to predict game results by odds, first five(5) bookmakers are selcted to do analysis. Then the data is ordered and only the final odds are selected to predict the result, because they are more accuratte. In addition, total handicap values NA and 2.5 are used because we try to predict over/under 2.5 and the analysis made based on only specific bet types like “ou”" and "1x2

DT_odd_details_5=DT_odd_details[bookmaker %in% c("bet365","1xBet","10Bet","Betclic","BetVictor")]

DT_odd_details_5=DT_odd_details_5[order(matchId,oddtype,bookmaker,date)]

DT_odd_details_5_final=DT_odd_details_5[,list(final_odd=odd[.N]), 
                                        by=list(matchId,oddtype,bookmaker,betType,totalhandicap)]
DT_odd_details_5_final=DT_odd_details_5_final[totalhandicap %in% c(NA,2.5)]

DT_odd_details_5_final=DT_odd_details_5_final[betType %in% c("ou","1x2","bts","dc")]

The data turned into wide format and the NA values is deleted.

wide_odd_detail_5=dcast(DT_odd_details_5_final,matchId~bookmaker+betType+oddtype+totalhandicap,
            value.var = 'final_odd')

wide_odd_detail_5=wide_odd_detail_5[complete.cases(wide_odd_detail_5)]

We need match data to get the information about match results so, match data is prepared for analysis and NA values are deleted. then the merge operation executed.

DT_matches[,c("HomeScore","AwayScore"):= tstrsplit(score,":")]
DT_matches[,HomeScore:=as.numeric(HomeScore)]
DT_matches[,AwayScore:=as.numeric(AwayScore)]
DT_matches[,Sum_score:=(HomeScore+AwayScore)]

DT_matches[,IsOver:=0]
DT_matches[Sum_score>2,IsOver:=1]
DT_matches[,who_win:=0]
DT_matches[HomeScore>AwayScore, who_win:=1]
DT_matches[HomeScore<AwayScore, who_win:=2]

DT_matches=DT_matches[complete.cases(DT_matches)]

wide_odd_detail_5_merged=merge(DT_matches,wide_odd_detail_5,by="matchId") 
wide_odd_detail_5_merged=wide_odd_detail_5_merged[order(matchId)]

Before the PCA the data needs to be scaled before scaling some information stored in another table and scale operation is executed. Storing is done because scale operation can only be done on numeric values. After scaling information about match results in added to the data.

first_12_column=wide_odd_detail_5_merged[,c(1:13)]
wide_odd_detail_5_scaled=as.data.table(scale(wide_odd_detail_5_merged[,c(13:60)]))

wide_odd_detail_5_new_1=wide_odd_detail_5_scaled[,IsOver_1:=first_12_column[,IsOver]]
wide_odd_detail_5_new=wide_odd_detail_5_scaled[,winnig_team:=first_12_column[,who_win]]

PCA operation is executed. As you can see the component one can explain the almost 50 prcent of the variation in the data and the component 2 can explain almost 30 percent of the variation. It means the scale reduction is made effectively. But component after 2 cannot explain the variation good enough to take into considiration. Also, if we can get enough information with less variable it is desired.

odds_pca=princomp(wide_odd_detail_5_new[,1:48],cor = TRUE)
summary(odds_pca)
## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4
## Standard deviation     4.8418597 3.7028821 2.9535183 0.89210382
## Proportion of Variance 0.4884084 0.2856528 0.1817348 0.01658019
## Cumulative Proportion  0.4884084 0.7740613 0.9557961 0.97237625
##                             Comp.5      Comp.6      Comp.7     Comp.8
## Standard deviation     0.523018536 0.414719858 0.390599090 0.29636355
## Proportion of Variance 0.005698925 0.003583178 0.003178493 0.00182982
## Cumulative Proportion  0.978075177 0.981658355 0.984836848 0.98666667
##                             Comp.9     Comp.10      Comp.11      Comp.12
## Standard deviation     0.281777635 0.252483422 0.2098685768 0.2072030238
## Proportion of Variance 0.001654138 0.001328081 0.0009176004 0.0008944394
## Cumulative Proportion  0.988320806 0.989648887 0.9905664874 0.9914609268
##                             Comp.13      Comp.14      Comp.15      Comp.16
## Standard deviation     0.1962136220 0.1792599140 0.1778959808 0.1711719372
## Proportion of Variance 0.0008020789 0.0006694608 0.0006593121 0.0006104132
## Cumulative Proportion  0.9922630057 0.9929324664 0.9935917785 0.9942021917
##                             Comp.17      Comp.18      Comp.19      Comp.20
## Standard deviation     0.1650935078 0.1607742416 0.1456378877 0.1413199604
## Proportion of Variance 0.0005678305 0.0005385074 0.0004418832 0.0004160694
## Cumulative Proportion  0.9947700222 0.9953085297 0.9957504129 0.9961664823
##                             Comp.21      Comp.22      Comp.23     Comp.24
## Standard deviation     0.1364156836 0.1287630178 0.1224145432 0.110443252
## Proportion of Variance 0.0003876925 0.0003454149 0.0003121942 0.000254119
## Cumulative Proportion  0.9965541747 0.9968995896 0.9972117838 0.997465903
##                             Comp.25      Comp.26      Comp.27      Comp.28
## Standard deviation     0.1090320618 0.1081304781 0.1007479426 0.0933262292
## Proportion of Variance 0.0002476665 0.0002435875 0.0002114614 0.0001814539
## Cumulative Proportion  0.9977135693 0.9979571568 0.9981686182 0.9983500721
##                             Comp.29      Comp.30      Comp.31      Comp.32
## Standard deviation     0.0910026136 0.0874038148 0.0850777681 0.0818592006
## Proportion of Variance 0.0001725307 0.0001591547 0.0001507964 0.0001396027
## Cumulative Proportion  0.9985226028 0.9986817575 0.9988325539 0.9989721566
##                             Comp.33      Comp.34      Comp.35      Comp.36
## Standard deviation     0.0751951365 0.0747104647 0.0701459628 6.824514e-02
## Proportion of Variance 0.0001177981 0.0001162844 0.0001025095 9.702915e-05
## Cumulative Proportion  0.9990899547 0.9992062391 0.9993087486 9.994058e-01
##                             Comp.37      Comp.38      Comp.39      Comp.40
## Standard deviation     0.0642969717 6.227901e-02 5.908809e-02 5.628847e-02
## Proportion of Variance 0.0000861271 8.080572e-05 7.273755e-05 6.600817e-05
## Cumulative Proportion  0.9994919049 9.995727e-01 9.996454e-01 9.997115e-01
##                             Comp.41      Comp.42      Comp.43      Comp.44
## Standard deviation     0.0549176242 5.276330e-02 4.600461e-02 4.379872e-02
## Proportion of Variance 0.0000628322 5.799929e-05 4.409217e-05 3.996516e-05
## Cumulative Proportion  0.9997742885 9.998323e-01 9.998764e-01 9.999163e-01
##                             Comp.45      Comp.46      Comp.47      Comp.48
## Standard deviation     3.456758e-02 3.357467e-02 3.110836e-02 2.693559e-02
## Proportion of Variance 2.489412e-05 2.348456e-05 2.016105e-05 1.511512e-05
## Cumulative Proportion  9.999412e-01 9.999647e-01 9.999849e-01 1.000000e+00

Coordinates of the stocres are stored to plot and then plot is created. Ass you can see from the graph the Black and Yellow points do not have a pattern, It means neither one of the principal components have infomation on over under 2.5 match results.

getcoordinates = odds_pca$scores
coordinates_dataframe=as.data.frame(getcoordinates)
x_axes=getcoordinates[,1]
y_axes=getcoordinates[,2]
z_axes=getcoordinates[,3]
plot_ly(coordinates_dataframe,x=x_axes,y=y_axes,color=wide_odd_detail_5_new$IsOver_1,colors=c('black','yellow'),type = "scatter",mode='markers')

Task1.b Can over/under 2.5 game results be explained by the odds for different types of bets using MDS?

The data is scaled already but only odds data needed but the table includes match resukt data too. so, data is filtered.

wide_odd_detail_MDS_extract_scaled=wide_odd_detail_5_new[,1:48]

First the 2 distance matrix is calculated using Euclidean(default) and Manhattan distances. Then MDS is executed for both options.

d=dist(wide_odd_detail_MDS_extract_scaled)

mds_euclidean<- cmdscale(d,eig=TRUE, k=2)

d2=dist(wide_odd_detail_MDS_extract_scaled,method = "manhattan")

mds_manhattan <- cmdscale(d2,eig=TRUE, k=2)

Then graphs are plotted. The scale of data is decresed but the information stored does not provide over under 2.5 socore. Note: MDS graph includes numbers so we can compare data with PCA without needing color coding. After analysis we can say that they are almost same.

plot(mds_euclidean$points[,1], mds_euclidean$points[,2], xlab="Coordinate 1", ylab="Coordinate 2", 
     main="Euclidean    MDS",   type="n")
text(mds_euclidean$points[,1], mds_euclidean$points[,2], labels = row.names(wide_odd_detail_MDS_extract_scaled), cex=.7)

plot(mds_manhattan$points[,1], mds_manhattan$points[,2], xlab="Coordinate 1_2", ylab="Coordinate 2_2", 
     main="Manhattan    MDS",   type="n")
text(mds_manhattan$points[,1], mds_manhattan$points[,2], labels = row.names(wide_odd_detail_MDS_extract_scaled), cex=.7)

Task1.c Comparison of PCA and MDS

PCA and MDS gave the almost same result because the idea behind them is same. They try to explain big picture with less variables. But MDS using Manhattan distence shows deviations compared to euclideam

Task2 Can home tie away results be explaind by PCA?

As you can see from the graph, Tie and Away results can be explained using PCA. Because they are scattered along different principal component. It means that principal component explaint that result. But Home(red) is scattered everywhere so it means it cannot be explained via PCA.

plot_ly(coordinates_dataframe,x=x_axes,y=y_axes,z=z_axes,color=wide_odd_detail_5_new$winnig_team,colors=c('red','blue','green'),type = "scatter3d",mode='markers')

Task 3

We will use Cim Adam(Grass Man) photo to perform image operations using PCA

Task3.1

Image is readed using readJpeg function.

cim_adam=readJPEG("/Users/alifurkanyavuz/Downloads/cim_adam.jpg", native=FALSE)

Task3.2

The structure and dimension is found

is.array(cim_adam)
## [1] TRUE
dim(cim_adam)
## [1] 64 64  3

Task3.2.a

Image displaying

plot(c(0, 64), c(0, 64),type = "n", xlab = "", ylab = "",axes = FALSE)
rasterImage(cim_adam,0,0,64,64)

Channels are extracted and stored in matrices to display later. Each channel extracted in teh same way so only one is supplied.

matrix1=cim_adam[,,1]

Displaying extacted channes on a single plot

par(mfrow=c(3,1)) 
image(matrix1,useRaster=TRUE, axes=FALSE)
image(matrix2,useRaster=TRUE, axes=FALSE)
image(matrix3,useRaster=TRUE, axes=FALSE)

par(mfrow=c(1,1)) 

Task3.2

Noise matrix array and then added image then image is normalized to display otherwise it cannot be display becaise values must be between (0,1)

noise_image<- array(runif(786432,min = 0,max=0.1), dim=c(64,64,3))
noisy_image=cim_adam+noise_image
noisy_image=(noisy_image-min(noisy_image))/(max(noisy_image)-min(noisy_image))

Task3.3.a Displaying noisy image.

plot(c(0, 64), c(0, 64),type = "n", xlab = "", ylab = "")
rasterImage(noisy_image,0,0,64,64)

Extracting channels from noisy image

matrix4=noisy_image[,,1]

matrix5=noisy_image[,,2]

matrix6=noisy_image[,,3]

Task3.3.b Displaying extacted channes on a single plot

par(mfrow=c(3,1)) 
image(matrix1,useRaster=TRUE, axes=FALSE)
image(matrix2,useRaster=TRUE, axes=FALSE)
image(matrix3,useRaster=TRUE, axes=FALSE)

par(mfrow=c(1,1)) 

Task 3.4.a Applying PCA to the Cim Adam(Grass Man)

The image turned into gray scale and normalized.

bar<- noisy_image[,,1]+noisy_image[,,2]+noisy_image[,,3]
bar <- bar/max(bar)

The gray scale image turned into data frame to make operations easier then column and row names assigned

a=as.data.frame(bar)
b=c(1:64)
colnames(a)[1:64]=b
rownames(a)[1:64]=b

A matrix created to store pathes and patches are extracted.

patch_cim_adam=matrix(data = NA,9,0)

for(j in 1:62)
  for (i in 1:62) {
    temp_var=a[i:(i+2),j:(j+2)]
    temp_var=unlist(temp_var)
    patch_cim_adam=cbind(patch_cim_adam,(temp_var))
  }

The data is scaled and found transpose to perform PCA

patch_cim_adam=data.frame(scale(patch_cim_adam))
patch_cim_adam=transpose(patch_cim_adam)

PCA performed none of the components can explain variance perfectly as expected. Because the image is not distributed based on a distribution it represents an object which has different attributes.

pca_cimadam=princomp(patch_cim_adam,cor=TRUE)
summary(pca_cimadam)
## Importance of components:
##                           Comp.1    Comp.2    Comp.3    Comp.4    Comp.5
## Standard deviation     1.3610363 1.2229900 1.1740194 0.9963410 0.9507518
## Proportion of Variance 0.2058244 0.1661894 0.1531468 0.1102995 0.1004366
## Cumulative Proportion  0.2058244 0.3720138 0.5251607 0.6354601 0.7358967
##                            Comp.6     Comp.7     Comp.8       Comp.9
## Standard deviation     0.93237895 0.87891900 0.85738007 1.490116e-08
## Proportion of Variance 0.09659228 0.08583318 0.08167784 2.467162e-17
## Cumulative Proportion  0.83248898 0.91832216 1.00000000 1.000000e+00

Task3.4.b

To reconstract the image based on components scores are extracted. Scores for different components found in the same way so code is only procided for the first one.In addition data is normalized.

pca_score=pca_cimadam$scores
PCA1=x_axes=pca_score[,1]
PCA1_matrix=matrix(PCA1,nrow = 62,ncol = 62)
PCA1_matrix=(PCA1_matrix-min(PCA1_matrix))/(max(PCA1_matrix)-min(PCA1_matrix))

Plotting scores as an image for first three components.

plot(c(0, 62), c(0, 62),type = "n", main="PCA1",xlab = "", ylab = "",axes = FALSE)
rasterImage(PCA1_matrix,0,0,62,62)

plot(c(0, 62), c(0, 62),type = "n", main="PCA2", xlab = "", ylab = "",axes = FALSE)
rasterImage(PCA2_matrix,0,0,62,62)

plot(c(0, 62), c(0, 62),type = "n",  main="PCA3",xlab = "", ylab = "",axes = FALSE)
rasterImage(PCA3_matrix,0,0,62,62)

Task3.4.c

Correlation is calculated then eigenvectors extracted. Then matrix created from the vector. Data is normalized to display. This operation is same for all vector so code will be provided only for one.

cim_adam_cor=cor(patch_cim_adam)
cim_eigen=eigen(cim_adam_cor)
cim_eigen_1=cim_eigen$vectors[,1]
eigen1_matrix=matrix(data=cim_eigen_1,3,3)
eigen1_matrix=(eigen1_matrix-min(eigen1_matrix))/(max(eigen1_matrix)-min(eigen1_matrix))

Then the vectors are plotted.

plot(c(0, 3), c(0, 3),type = "n", main="PCA1",xlab = "", ylab = "",axes = FALSE)
rasterImage(eigen1_matrix,0,0,3,3)

plot(c(0, 3), c(0, 3),type = "n", main="PCA2",xlab = "", ylab = "",axes = FALSE)
rasterImage(eigen2_matrix,0,0,3,3)

plot(c(0, 3), c(0, 3),type = "n", main="PCA2",xlab = "", ylab = "",axes = FALSE)
rasterImage(eigen2_matrix,0,0,3,3)

plot(c(0, 3), c(0, 3),type = "n",main="PCA3", xlab = "", ylab = "",axes = FALSE)
rasterImage(eigen3_matrix,0,0,3,3)